Skip to content

Optimize MongoDBExportPartitionSupplier for uniform _id type collections#6910

Merged
dinujoh merged 1 commit into
opensearch-project:mainfrom
dinujoh:main
Jun 9, 2026
Merged

Optimize MongoDBExportPartitionSupplier for uniform _id type collections#6910
dinujoh merged 1 commit into
opensearch-project:mainfrom
dinujoh:main

Conversation

@dinujoh

@dinujoh dinujoh commented Jun 7, 2026

Copy link
Copy Markdown
Member

Description

For collections with uniform _id types, replace the $or query with a simple Filters.gt("_id", value) for finding partition boundaries. This allows DocumentDB to use a single B-tree index seek instead of multi-index scan.

Changes:

  • Add isUniformIdType() that checks first/last doc _id types
  • Add buildNextStartFilter() with simple $gt for uniform types, falling back to $or-based query for mixed types
  • Use fresh Filters.gte() + skip() per iteration for partition end
  • Extract addPartition() helper to reduce duplication
  • Make BsonHelper.isClassNumber() public for numeric type grouping

Performance: 14M docs (10GB) partitioned in ~30 seconds.

Check List

  • New functionality includes testing.
  • New functionality has a documentation issue. Please link to it in this PR.
    • New functionality has javadoc added
  • Commits are signed with a real name per the DCO

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions

github-actions Bot commented Jun 7, 2026

Copy link
Copy Markdown

✅ License Header Check Passed

All newly added files have proper license headers. Great work! 🎉

import static org.mockito.Mockito.when;

@ExtendWith(MockitoExtension.class)
public class MongoDBExportPartitionSupplierIsUniformIdTypeTest {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following existing conventions, add an underscore for clarity: MongoDBExportPartitionSupplier_IsUniformTypeTest. Also, make this package protected (remove public modifier).

* If uniform, we can use a simple Filters.gt() instead of the complex $or query across all BSON types.
*/
boolean isUniformIdType(final MongoCollection<Document> col) {
final Document first = col.find().projection(ID_PROJECTION).sort(ID_ASC).limit(1).first();

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these two be combined to avoid two network calls?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DocumentDB doesn't support $facet aggregation to get first and last in one query. The two queries are both indexed _id lookups (ascending limit 1, descending limit 1) each takes <1ms.

final Object gteValue = startDoc.get("_id");
final String gteClassName = gteValue.getClass().getName();

final Document endDoc = col.find(Filters.gte("_id", gteValue))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe name this endOfPageDoc or something similar for clarity.

.thenReturn(new Document("_id", 3.14))
.thenReturn(new Document("_id", Decimal128.parse("99.99")));
assertThat(supplier.isUniformIdType(collection), is(true));
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also include a test case for a real number type like double and and integer type as well.


// isUniformIdType: col.find() called twice (first asc, last desc)
// then col.find() for last doc when endDoc is null
when(col.find()).thenReturn(uniformCheckFirst, uniformCheckLast, lastDocIterable);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to use whenAnswer. Then look at the input to determine which to return. This is creating a coupling of the order here with the order in the implementation that need not exist.

@dlvenable dlvenable left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dinujoh for this contribution! This looks like a good performance improvement.

For collections with uniform _id types, replace the 8-clause $or query
with a simple Filters.gt("_id", value) for finding partition boundaries.
This allows DocumentDB to use a single B-tree index seek instead of
multi-index scan.

Changes:
- Add isUniformIdType() that checks first/last doc _id types
- Add buildNextStartFilter() with simple $gt for uniform types,
  falling back to $or-based query for mixed types
- Use fresh Filters.gte() + skip() per iteration for partition end
- Extract addPartition() helper to reduce duplication
- Make BsonHelper.isClassNumber() public for numeric type grouping

Performance: 14M docs (10GB) partitioned in ~30 seconds.

Signed-off-by: Dinu John <86094133+dinujoh@users.noreply.github.com>
@dinujoh dinujoh merged commit 2bd162c into opensearch-project:main Jun 9, 2026
71 of 72 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants